wiki:Tagging and Parsing

Language-specific Taggers and Parsers in the OPUS corpus

In the OPUS corpus, language-specific tools for tagging and parsing have been collected, and are available for download here: DownloadTools. For a consistent tagging and parsing procedure, the same tagging and parsing tools have been used for most of the languages, i.e. the Hunpos tagger (Péter Halácsy, András Kornai, Csaba Oravecz, 2007, Hunpos - an open source trigram tagger) and the Maltparser (Joakim Nivre and Johan Hall, 2005, Maltparser: A language-independent system for data-driven dependency parsing). For some languages, alternative taggers and/or parsers are used.

Czech

The tagger used for tagging Czech texts is the Hunpos tagger, trained on the Prague Dependency Treebank (PDT).

The parser used for parsing Czech texts is Maltparser, trained on the Prague Dependency Treebank (PDT). The Czech parsing model was provided by Marco Kuhlmann, Uppsala University (Marco Kuhlmann and Joakim Nivre, 2010, Transition-Based Techniques for Non-Projective Dependency Parsing).

Chinese

For Chinese, the Zpar parser is used for segmentation, tagging and parsing. The Chinese model was downloaded from http://sourceforge.net/projects/zpar/files/0.4/.

Danish

The tagger used for tagging Danish texts is the Hunpos tagger, trained on the Danish Dependency Treebank (http://www.id.cbs.dk/~mtk/treebank/).

The parser used for parsing Danish texts is Maltparser, trained on the Danish Dependency Treebank (DDT). Optimized settings were provided by Joakim Nivre, Uppsala University.

Dutch

The parser used for parsing Dutch texts is Maltparser, trained on the CDB corpus, i.e. the newspaper part of the Alpino Treebank (van Noord, 2006). The Dutch parsing model was provided by Barbara Plank, University of Groningen.

English

The tagger used for tagging English texts is the Hunpos tagger, trained on the Wall Street Journal section of the Penn Treebank. The English tagging model was downloaded from http://code.google.com/p/hunpos/downloads/list.

The parser used for parsing English texts is Maltparser, trained on the Wall Street Journal section of the Penn Treebank extended with about 4000 questions from the Question Bank, converted to dependency trees using the Stanford Parser. The English parsing model was downloaded from http://maltparser.org/mco/english_parser/engmalt.html.

French

The tagger used for tagging French texts is the MElt tagger (Denis and Sagôt, 2009, Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort). The French tagging model was downloaded from https://gforge.inria.fr/frs/download.php/27240/melt-0.6.tar.gz.

The parser used for parsing French texts is Maltparser, trained on a dependency version of the French Treebank. The French parsing model was downloaded from http://maltparser.org/mco/french_parser/fremalt.html.

German

The parser used for parsing German texts is Maltparser, trained on the Tiger Treebank. The German parsing model was provided by Marco Kuhlmann, Uppsala University (Marco Kuhlmann and Joakim Nivre, 2010, Transition-Based Techniques for Non-Projective Dependency Parsing).

Hungarian

The tagger used for tagging Hungarian texts is the Hunpos tagger. The Hungarian tagging model was downloaded from http://code.google.com/p/hunpos/downloads/list.

Italian

Pre-processing tools and taggers for Italian are bundled in TextPro. The parser is trained with MaltParser?.

Portuguese

The tagger used for tagging Portuguese texts is the Hunpos tagger, trained on the Floresta corpus.

The parser used for parsing Portuguese texts is Maltparser, trained on the Floresta corpus.

Russian

The tagger used for tagging Russian texts is the Hunpos tagger.

The parser used for parsing Russian texts is Maltparser.

Slovene

The tagger used for tagging Slovene texts is the Hunpos tagger, trained on the jos100k corpus, version 2.0. Training data was provided by Tomaž Erjavec, Department of Knowledge Technologies, Jozef Štefan Institute.

The parser used for parsing Slovene texts is Malt Parser, trained on the jos100k corpus, version 2.0. Training data was provided by Tomaž Erjavec, Department of Knowledge Technologies, Jozef Štefan Institute.

Spanish

The tagger used for tagging Spanish texts is the SVMTool (Jesús Giménez and Lluis Màrquez, 2004, SVMTool: A general POS tagger generator based on Support Vector Machines), trained on the Ancora corpus. The Spanish tagging model was provided by Jesús Giménez, Universitat Politècnica de Catalunya Barcelona Tech.

The parser used for parsing spanish texts is Maltparser, trained on the Ancora corpus. The Spanish parsing model was provided by Jesús Giménez.

Swedish

The tagger used for tagging Swedish texts is the Hunpos tagger, trained on the SUC corpus, version 2.0. The Swedish tagging model was provided by Uppsala University.

The parser used for parsing Swedish texts is Maltparser, trained on the Talbanken section of the Swedish Treebank. The Swedish parsing model was downloaded from http://maltparser.org/mco/swedish_parser/swemalt.html.

Turkish

For the morpho-syntactic annotation of Turkish texts, a morphological segmenter and analyser developed by Kemal Oflazer is used, leaving ambiguous tokens (Kemal Oflazer, 1994, Two-level description of Turkish morphology). The disambiguation is performed using the disambiguator described by Yüret and Türe (Deniz Yüret and Ferhan Türe, 2006, Learning morphological disambiguation rules for Turkish).

The parser used for parsing Turkish texts is Maltparser, with a pre-trained Turkish model provided by Gülsen Eryigit, İstanbul Teknik Üniversitesi (J. Eryigit G., Nivre and K. Oflazer, 2006, The incremental use of morphological information and lexicalization in data-driven dependency parsing).

Last modified 7 years ago Last modified on Nov 3, 2013, 12:25:40 PM